Erasure coding and replication in storage clusters

ABSTRACT

A cluster receives a request to store an object using replication or erasure coding. The cluster writes the object using erasure coding. A manifest is written that includes an indication of erasure coding and a unique identifier for each segment. The cluster returns a unique identifier of the manifest. The cluster receives a request from a client that includes a unique identifier. The cluster determines whether the object has been stored using replication or erasure coding. If using erasure coding, the method reads a manifest. The method identifies segments within the cluster using unique segment identifiers of the manifest. Using these unique segment identifiers, the method reconstructs the object. A persistent storage area of another disk is scanned to find a unique identifier of a failed disk. If using erasure coding, a missing segment previously stored on the disk is identified. The method locates other segments. Missing segments are regenerated.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is related to U.S. patent application Ser. No.13/517,523 filed on the same date herewith, entitled “TWO LEVELADDRESSING IN STORAGE CLUSTERS” and to U.S. patent application Ser. No.13/517,525 filed on the same date herewith, entitled “ELIMINATION OFDUPLICATES IN STORAGE CLUSTERS,” both of which are hereby incorporatedby reference.

FIELD OF THE INVENTION

The present invention relates generally to erasure coding. Morespecifically, the present invention relates to a combination of erasurecoding and replication in a fixed-content storage cluster and to volumefailure recovery when using erasure coding.

BACKGROUND OF THE INVENTION

Traditionally, storage provided in a storage cluster (such as by using aredundant array of independent nodes, or RAIN) is made reliable againsthardware failure either through replication of stored objects or erasurecoding of stored objects. The former has the advantage that the sameunique identifier can access the multiple replicas (using a journal andRAM-based indexing scheme, for example), but has the disadvantage ofhigh bandwidth and storage overhead (depending upon the number ofreplicas desired, large objects can take up a significant amount ofspace). The latter enjoys the benefit of a smaller storage footprint andless overhead for similar level of protection against media failures,but suffers from the drawback that each segment of an erasure set isdifferent content that must be separately identified in order to readthe object or to reconstruct any lost segments. This identification canbe especially problematic when a storage cluster is restarted. Erasurecoding will also incur a higher processing overhead and lose itsfootprint advantage when storing small objects.

Thus, both techniques have disadvantages. Further, some prior artapproaches applicable to erasure coding use a control database separatefrom the storage cluster in order to identify and track segments of aparticular object; this approach is problematic because it introducesmore overhead and calls into question the availability of this controldatabase and whether or not it needs to be replicated. Also, even thoughunder erasure coding an object can be reconstructed using a subset ofthe segments used to encode that object (e.g., if there had been a diskfailure), it can be time consuming not only to identify which segmentsare no longer present, but also to locate the remaining segments.

Accordingly, improved techniques are desired for use with storageclusters in order to take advantage of the benefits of replication anderasure coding as well as to limit exposure after a hardware failure.

SUMMARY OF THE INVENTION

To achieve the foregoing, and in accordance with the purpose of thepresent invention, a storage cluster is disclosed that combines bothobject replication and erasure coding in order to utilize the advantagesof each.

In one embodiment, a method stores a digital object in a storagecluster. First, the storage cluster receives from a client application arequest at a computer node of the storage cluster to store a digitalobject. The storage cluster then determines whether to store the digitalobject in the storage cluster using replication or erasure coding. Thisdetermination may be made by reference to an instruction from theclient, an inherent property of the object, the metadata of the object,a setting of the cluster, or by other means. When it is determined tostore the digital object using erasure coding, the storage clusterwrites the digital object to a number of computer nodes of the storagecluster using erasure coding and the digital object is stored as aplurality of segments. In addition, a manifest computer file is createdthat includes an indication of the erasure coding and a uniqueidentifier within the storage cluster for each of the segments. Thestorage cluster then stores the manifest computer file on one or morenodes of the cluster and returns a unique identifier of the manifestcomputer file to the client application. Manifests are distinguishablefrom other digital objects.

In another embodiment, a method reads a digital object from a storagecluster having a plurality of computer nodes. First, one of the computernodes within the storage cluster receives a request from a clientapplication that includes a unique identifier for the digital object.The storage cluster finds the object so identified on one of the nodeswhere it is stored. If the object is not a manifest as described above,it is returned to the client application. If the object is a manifest,it next identifies a plurality of segments within the storage clusterusing unique segment identifiers found within the manifest. Using theseunique segment identifiers, the method reconstructs the digital objectusing the segments and an erasure coding algorithm. Finally, the methodreturns the digital object to the client application.

In another embodiment, where the client application wishes to replacethe content associated with a unique identifier with a new version ofthe content, the invention behaves as described in the previousparagraph, to find the object, then as in the paragraph prior, to writeit. The new version will retain the previous version's uniqueidentifier, but have a later creation timestamp, which distinguishes thetwo versions during the update process. The older version is deletedonce the newer version is successfully written in the cluster. Thehealth processing module may also delete older versions of objects forwhich newer versions are present, as a way of maintaining cluster datafrom error states.

In another embodiment, a method recovers from a failed disk. First, themethod detects within a storage cluster (having a plurality of computernodes) that a disk of one of the nodes has failed. Next, the methodscans a persistent storage area of another disk of the storage clusterin order to find a unique identifier of the failed disk; this uniqueidentifier is in association with a digital stream of the storagecluster. It is then determined whether the digital stream is storedwithin the storage cluster using replication or erasure coding; when itis determined that the digital stream is stored using erasure coding,the method identifies a missing segment previously stored on the faileddisk. Using sibling identifiers from the digital stream, the methodlocates a plurality of other segments within the storage cluster. Next,the method regenerates the missing segment using a number of the othersegments and an erasure coding algorithm. Finally, the method stores theregenerated segment on a computer node of the storage cluster.

In another embodiment, a segment may be relocated within a storagecluster without the need for an extra control computer or controldatabase. When a segment is relocated, the volume hint in its upstreamsibling segment is updated in the metadata of the sibling segment.Alternatively, the volume hint is updated in the stream representationof the sibling segment in its journal on disk. The storage cluster maybroadcast the unique identifier of the sibling segments in order to findthe disk where the sibling segment is located. The manifest may also beused to find an upstream segment or a sibling segment of the segmentthat is being relocated.

In other embodiments, metadata stored with an object or within clustersettings dictates when an object should be converted to a differentformat. When a triggering condition is satisfied, the cluster convertsthe object from a storage using replication to an erasure coding format,from one erasure coding format to another, or from an erasure codingformat to a replication storage format. The original object in its oldformat may be deleted if desired. Advantageously, the unique identifierused with the original object is retained for use with the object in itsnew storage format, thus ensuring that the client application thatoriginally stored the object may retrieve it at any future date usingthe original unique identifier with which it was provided.

In another embodiment, an object may be moved from one storage clusterimplementing one storage format (e.g., replication, the specific erasurecoding, etc.) to a second storage cluster that does not necessarilyimplement the same storage format. When moved, the object will beautomatically converted into the storage format used by the secondcluster. Conversion of the object may be dictated by default settings inthe second storage cluster, by user metadata of the object (which trumpscluster settings), or by an instruction from a program initiating themove.

In general, an extra control database is not needed with any of theembodiments of the present invention. Inherently, a digital objectstored within the cluster can be written, read and managed using itsunique identifier, irrespective of whether the object is stored usingreplication or erasure coding.

BRIEF DESCRIPTION OF THE DRAWINGS

The invention, together with further advantages thereof, may best beunderstood by reference to the following description taken inconjunction with the accompanying drawings in which:

FIG. 1 illustrates an environment for operation of the presentinvention.

FIG. 2 illustrates an example erasure set for an object using 5:7erasure coding.

FIG. 3 presents an example manifest that may be used in embodiments ofthe invention.

FIGS. 4A and 4B are a flow diagram describing how a client applicationwrites a file to the storage cluster.

FIG. 5 is a flow diagram describing how a client application reads adigital object from the storage cluster.

FIG. 6 is a flow diagram describing how a storage cluster may recoverfrom a volume failure.

FIG. 7 is a flow diagram describing how object may be converted from oneformat to another.

FIG. 8 is a flow diagram describing how management across storageclusters may be performed.

FIGS. 9A and 9B illustrate a computer system suitable for implementingembodiments of the present invention.

DETAILED DESCRIPTION OF THE INVENTION

As is known in the art, erasure coding is a technique for providingredundancy of a data object without the overhead of replication. Given aparticular data object, erasure coding breaks the object into K datasegments and from those data segments generates P parity segments, for atotal of M segments in an erasure set, commonly indicated as a K:Merasure code. For example, a data object broken into 5 segments whichare then used to generate 2 parity segments is referred to as using a5:7 erasure code. A key property of erasure codes is that the originalobject can be reconstructed from any K segments, whether a segment of anerasure-coded object is an original data segment or one of the paritysegments. Accordingly, it is advantageous to distribute each segment todifferent volumes within a storage cluster (and to different nodes),thus protecting the data object against the loss of any two volumeswithin the cluster. Given adequate nodes in the cluster, segments aredistributed to different nodes, providing protection against node loss.If nodes are located in different geographic regions, segments aredistributed evenly among the regions to protect, to the extent possible,against loss of the portion of a cluster in one geographic locale.

In one embodiment, the invention allows objects to be stored eitherusing replication or erasure coding. The cluster may switch back andforth on a per object basis depending upon instructions from a clientapplication, an object's category, an object's size, an object'smetadata, etc. For example, relatively large objects may all be storedusing erasure coding, while relatively small objects may all be storedusing replication. Ordinary replication may be viewed as a special caseof erasure coding where K=1 and where M is equal to the total number ofreplicas of a given object. Also, different objects may be assigneddifferent erasure codings. In the case of very large objects, severalsuch erasure sets may be used to represent the object. By allowingdifferent objects to be assigned different encodings, including ordinaryreplication, the invention allows different levels of protection againstdata loss, at different processing costs and storage footprints.

In a second embodiment, the invention addresses the problem with tryingto identify and find segments of an object under erasure coding. Insteadof using a separate database, a manifest file (or object) includes adescription of each segment associated with a particular object.Included within the manifest are the unique identifier within thecluster for each segment, the size of each segment, which encodingalgorithm is used and the specific erasure coding (5:7, etc.) for theobject. The manifest is then treated as an ordinary object within thecluster, provided with a unique identifier, and replicated as required(e.g., replicated to the same degree of redundancy, P+1). Replication ofthe manifest is simple because the cluster is already performingreplication for other objects and has little storage overhead becausethe manifest is relatively small in size. Thus, the segments of aparticular erasure-coded object are quickly and easily accessible viathe manifest. The identifier for this manifest object is returned toclient applications for future retrieval of the object. This providesthe efficient footprint of erasure coding, while preserving the simpleidentification, high availability, and fast startup of ordinaryreplication.

In a third embodiment, the invention addresses exposure to data losswithin a storage cluster by minimizing the time after a hardware failurebefore all missing segments are regenerated. Each segment of anerasure-coded object includes a hint as to the volume identifier withinthe cluster that holds the next segment of the object. The hint islikely to be the correct volume identifier but may not be guaranteed. Assoon as a hardware failure occurs (such as a disk failure), and beforeit is noticed that a particular segment is missing, each volume withinthe cluster will scan its journal on disk in order to find a segmenthaving as its hint a volume identifier for the failed volume. Thus, anymissing segments can be identified and can be regenerated as quickly aspossible, before waiting for any normal integrity checking of thecluster to occur.

Storage Cluster Example

As mentioned above, the present invention applies to digital objects,i.e., any type of information represented in digital form. For instance,a digital object may be an electronic representation of information suchas a computer file, a group of files, a group of file identifiers, orthe collections of data or database information. Such other collectionsof data include frames or clips from digital audio or video streams,digital photographs, scanned paper documents, voice messages, CAD/CAMdesigns, MRI or X-ray data, streams from message records or files, logentries from audits or status logs of systems, e-mail archives, checkimages, etc. The term “computer file” is often used herein to encompassany electronic representation of information.

The present invention may be implemented using any suitable computerhardware and software, and may be implemented upon a storage clusterthat includes any number of computer nodes. Preferably, each nodeincludes a CPU (or multiple CPUs), an operating system, communicationlinks to the other nodes (or, to at least a central router), and anynumber of internal hard disk drives or solid-state drives (i.e., fromzero to N), often referred to as volumes. Typically, each node includesat least one drive and there may be any combination of hard disk drivesalong with solid-state drives. A storage cluster is typically afixed-content cluster, meaning that it is used for backups, long-termstorage, archiving, etc., and is not typically used for day-to-dayaccess to computer files. Often referred to as WORM (write once, readmany) storage, this means that once a computer file or digital object iswritten to the cluster it cannot be changed. (Of course, the file may bedeleted and a modified version of the computer file may also be storedwithin the cluster.) A cluster may be implemented as a redundant arrayof independent nodes (a RAIN) meaning that each node runs its ownoperating system and makes independent decisions about storage withinthe cluster. Storage clusters may be built upon blades, towers, personalcomputers and servers. Alternatively, a multi-core processor within asingle computer box may support a virtual storage node running on eachcore, meaning that a storage cluster having multiple nodes may bepresent within the single computer box. Further, a computer systeminside a single physical box may include multiple CPUs in which caseeach CPU may represent a node and the storage cluster may be implementedwithin the single physical box.

FIG. 1 illustrates an environment 100 for operation of the presentinvention. Included is a storage cluster 120, a client application 130,an administrative console 140, any number of computer nodes 10-50, and acentral router 170. As mentioned above, a computer node is typically aphysical file server that preferably includes at least one CPU and anynumber of disk drives 160, solid-state drives or hybrid drives that haveboth types. In one particular embodiment, the storage cluster 120 may befurther logically or physically divided into sub clusters. For example,nodes 40 and 50 may be considered one subcluster, while nodes 10, 20 and30 may be considered a second subcluster. Division of a cluster intosub-clusters may be advantageous where one subcluster is located in adifferent geographic location from another subcluster.

Each node implements an operating system such as Debian Linux andexecutes processes to manage peer-to-peer communications between nodes,to perform health processing, and to make independent decisions onbehalf of the node and its volumes. Each node also includesadministrative software and its status can be viewed via a web browserover the Internet. In one particular RAIN embodiment, each node is a 1Userver (e.g., an x86 computer) with 1 terabyte or more of serial ATAdisk storage capacity with standard Ethernet networking. Each node hasan IP address and may be physically interconnected using an IP-basedLAN, MAN or WAN. Thus, each node may talk to a single node or maybroadcast a message to all nodes within the storage cluster (amulticast) using a router 170 or other similar network switch.

Each node includes a management module for handling external requestsfrom client applications (e.g., an SCSP request from client 130),replication requests between nodes (e.g., interSCSP requests), and otherinternode protocol communications (bidding, requests for information,etc.). A health processing module manages the digital content of eachnode. The administrative console 140 is preferably a Web server linkedto the storage cluster that allows access to each node over any suitableInternet connection. Each node implements a redundant administrativeconsole which can be used to view and administer the entire cluster. Incertain embodiments, all nodes are considered equal and communicate witheach other by periodically broadcasting (or “multicasting”) theirrelevant information to all other nodes within the cluster.

In one embodiment, a storage cluster may be implemented using contentstorage software available from Caringo, Inc. of Austin, Tex. (asmodified as described herein), and any suitable computer hardware. Inthis embodiment, a storage cluster implements fixed-contentcontent-addressable storage and each digital object is uniquelyaddressed within the cluster by a random number (a universally uniqueidentifier, or UUID) that has been generated for that digital objectusing a random number generator. The contents of each digital object maybe verified using a hash function. A client software applicationreceives the UUID when storing a digital object in the cluster andretrieves that digital object by supplying the UUID to the cluster.Software applications communicate with a CAStor cluster using standardHTTP 1.1, and more specifically, using a simplified subset of thestandard called Simple Content Storage Protocol (SCSP). Using thisstandard interface, client applications such as electronic mail,enterprise content management, health care applications, Web browsers,Web 2.0 sites, photo sharing, social media sites, security video, videoediting, etc., may access the CAStor storage cluster in order to storefiles, retrieve files or delete files. Further, direct HTPP access isavailable for browsers, JAVA, Python, C++, and other softwareenvironments.

In one embodiment, a digital object is stored on a particular node inthe following way. Each node includes a disk index in RAM listing wherea digital stream including the object is stored on disk based upon itsunique identifier. For example, a first column of the index lists theunique identifier of the object, a second column lists at which sectorthe stream starts and a third column lists either the length of thestream or the sector in which it ends. A stream may include the digitalobject as well as relevant metadata. Accordingly, objects to be storedon a node may simply be written sequentially to the disk and theirlocations recorded in the RAM index. Or, objects may be stored anywhereon the disk using any suitable storage algorithm and the objects'locations again will be recorded in the index. When an object is to beread or deleted, its location on the disk may be found by consultingthis index. To facilitate building up this RAM index upon a restart ofthe node, a journal of the node stored within persistent storage recordswhenever an object is added or deleted and includes the uniqueidentifier for the object, the sector at which it starts, its length insectors or bytes, and other information as described below. Accordingly,when the node is restarted the information in the journal is read and isused to create the disk index in RAM. Another technique to build up theindex instead of using a journal is to read the entire disk upon arestart in order to gather the necessary information although this willbe more time consuming.

Objects may be stored, and duplicates deleted, as described in “TwoLevel Addressing in Storage Clusters” and “Elimination of Duplicates inStorage Clusters” referenced above.

Erasure Set Example

FIG. 2 illustrates an example erasure set 200 for an object using 5:7erasure coding. As shown, the data of the original object is separatedinto five data segments (k1-k5) 210-218, from which are generated twoparity segments (p1 and p2) 220 and 222. In one embodiment, the data iswritten, and the parity is generated, in stripes (st1-st9) 231-239. Forexample, the first stripe 231 consists of original data 251-255 fromwhich are generated parity data 256 and 257. Any remaining data isformed in a final remainder stripe (rem) 240 and hash metadata may bestored at the end of each segment in a section 270.

As mentioned earlier, an object (or stream) to be stored within thecluster may be broken into several erasure sets of a given size, thesize being chosen for performance sake. A very large object may bebroken into several erasure sets, for example. Within an erasure set theK data segments and the P parity segments are written using stripes witha fixed-size block of data written successively into each of the K datasegments and then generating and writing the parity blocks to each ofthe P parity segments, each stripe (across the K+P segments) serving asan erasure coding unit. The last stripe in a segment (remainder stripe240, e.g.) may have a smaller block size that can be readily computed.Stripes are typically written until the incoming data is exhausted oruntil the given size for the erasure set has been filled, withsubsequent data starting a new erasure set.

In a specific embodiment, data from a single write operation is writtenin stripes across all K segments of the erasure set in fixed-size blocks(e.g., 32 k bytes). In other words, blocks 1, K+1, 2K+1, etc., arewritten to the first segment 210, blocks 2, K+2, 2K+2, etc., are writtento the second segment 212, etc. The P parity blocks are generatedsynchronously as part of the generation of each stripe of data and areincrementally written at the end of each stripe. Striping allows paritysegments to be generated synchronously with predictable memory overheadand also allows the writing of chunked-encoded streams, which usually donot provide the content length at the beginning of the write operation.Of course, the original data object need not be divided up into stripesin order to perform the erasure coding, but this provides someefficiency as the parity segments can be generated for each stripe andthen move on to the next stripe. Alternatively, if striping is not used,the first data segment 210 would include the first blocks of data withinthe data object, the second data segment 212 would include the next setof blocks of data, and so forth. At the end of each segment in the setis metadata 270 for each segment which includes an MD5 hash value forthat segment. For example, each hash value is 16 bytes and writtenwithout separators.

Each segment of an erasure set is an immutable stream, each having itsown unique identifier, just as with any other stream of the cluster.Each erasure segment also contains an additional header 280 containingmetadata concerning the other segments in the erasure set, such as theunique identifier of its manifest, and the unique identifiers of all ofthe data and parity segments, in order. Each segment's own uniqueidentifier will identify its place in its set of siblings. The headerfor each segment will also include system metadata such as the volumeidentifier where the segment is located, and the likely volumeidentifier for the next segment of the erasure set (the volume hint).For example, segment k4 includes the volume identifier where the segmentk5 is located, segment p2 includes the volume identifier were thesegment k1 is located, etc. Preferably, each erasure set being part of alarger stream (representing a single object) will have the same erasurecoding, e.g., 5:7 for all erasure sets.

During a write operation, each of the seven segments shown may bewritten in parallel using seven different nodes of the cluster, thusproviding for data protection should a node or volume fail. Once asegment is written to a volume, it may be treated like any other streamwithin the cluster (aside from the volume and its contents, its recoveryprocess, and the fact that it need not be replicated).

Manifest Example

FIG. 3 presents an example manifest 300 that may be used in embodimentsof the invention. As mentioned earlier, once erasure coding has beenused to encode a data object into any number of segments (depending uponthe encoding used), the unique identifiers for those segments may bestored within a manifest (it itself being an object stored within thecluster and having its own unique identifier) for later retrieval.

Included within each manifest is a metadata section 310, at least oneerasure set 340 and any number of other erasure sets 380. Of course, thenumber of erasure sets represented within the manifest depends upon thesize of the data object, the size of each segment, and the encodingscheme used. Section 312 of the manifest provides the name of theerasure coding algorithm used and its version number. The specificerasure coding is shown in section 314 and the size of each data block(stripe width) within a segment is shown in section 316. Otherinformation in this metadata section includes the time the object waswritten, its unique identifier, its ownership, and metadata specified bythe application.

The representation of the first erasure set (or the only erasure set)includes its size 342, the erasure coding used for this erasure set 344,the segment width 346 (in bytes), and the total segment size 348 (inbytes). Information such as the encoding and segment width may also bepresent within each erasure set as well as in metadata section 310 inorder to provide flexibility, or such data may only be present in onearea.

The first erasure set also includes the unique identifiers for each ofthe segments in that erasure set. For example, section 350 shows uniqueidentifiers for the segments in the first erasure set. As mentioned, themanifest may contain one or more erasure set. In that situation,representation of a second erasure set 380 may be included as well asany number of other representations of other erasure sets. The seconderasure set shows the size (that may be different from the first set) anencoding, and similar information as the first erasure set. The seconderasure set may use the same encoding as the first set, as well assegment width and segment size, but this is not a general requirement.

Any additional erasure sets will include the unique identifiers for allsegments included within that erasure set. Preferably, manifeststhemselves are replicated and not erasure coded. This will not hurtspace efficiency because a manifest only uses a small fraction of thespace of the data object for which it is a manifest. In addition, inorder to offer the same protection against simultaneous disk failures asthe erasure encoding of the erasure sets, the manifest for a data objectencoded using K:M (P=M−K) encoding should have a total number ofreplications of at least P+1.

Write Digital Object to Cluster

FIGS. 4A and 4B are a flow diagram describing how a client applicationwrites a file (a digital object) to the storage cluster. In step 404 anysuitable client application 130 desires to store a digital object (suchas any computer file, digital image, digital movie, health record, etc.)in the storage cluster 120. The client application discovers or obtainsan IP address of one of the nodes 10-50 within the cluster and choosesthat node as the primary access node (PAN) with which to begin thestorage process. Typically, a client application will access the lastnode used for its next request to write. The client application sends arequest to the PAN to store the digital object. In one embodiment, thisrequest is an HTTP POST request that includes a header, length in bytesof the digital object, and the object itself. The response to thisrequest from the PAN to the client will be one of: yes, the PAN canfacilitate storing the object; no, here is a better node to store theobject; or no, no node within this cluster can store that object.

Assuming that the PAN will facilitate storing the object, the client haspassed the digital object to be stored at this time, although it ispossible to wait until a secondary access node (SAN) is selected forstorage to pass the object. If the object is not passed initially, theclient may pass object metadata such as the size of the object, whetherlong-term or short-term storage is desirable, and whether the objectwill be accessed frequently in the future (all of which can aid in theselection of a storage node), and optionally a file name for the object.Some clients require that a hierarchical or arbitrary file name be usedfor objects that the client stores within the cluster, and in thesesituations a hash value may be derived from such a file name and used asthe unique identifier.

More relevant to this invention, though, is object metadata that may beused to determine whether to store the object using replication orerasure coding. With the write request, or in the object metadata, theclient application may specify that this object should be stored usingreplication or erasure coding (and may also specify a particular erasurecoding). In fact, the client may specify that all objects to be writtenbe stored using replication or erasure coding. In the absence of aspecific instruction from the client, the invention may use any of avariety of information in order to decide whether to select replicationor erasure coding for an object. For instance, the size of the objectmay be used (objects above a certain size being stored using erasurecoding), the type of object may be used (image file, text information,health record, etc.), object ownership, and expected lifespan.Additional metadata used to determine when and whether to convert anobject after a write is discussed below.

In step 408, once the secondary access node (SAN) has been selected, theSAN decides whether to use replication or erasure coding in order tostore the current object. As mentioned above, the SAN may useinstructions from a client application or may use any suitable rulebased upon the object metadata in order to make this decision. In oneparticular embodiment, the size of the digital object is used as thecriterion, or specifically, an object the size of less than 10 megabyteswill be replicated, but any larger objects will be stored using erasurecoding. In general, information that may be used to determine whether touse replication or erasure coding includes: any rules or settings of thestorage cluster set by the cluster administrator; any inherent propertyof the object itself or of its metadata; and the nature of the query orany instructions from the client application. Should replication bechosen, in step 412 the SAN requests and receives any number of writebids from nodes within the cluster for the writing of the digitalobject.

If the SAN has the lowest bid it responds by sending a “continue”message back to the client application. In response, the client sendsthe digital object to the SAN and the SAN stores the digital object,calculates a unique identifier and returns this identifier to the clientapplication. On the other hand, if the SAN loses the bid, then the SANredirects the client application to the node with the lowest bid thatwill then handle the request. The client application then sends the samewrite request to this node and the node responds by sending a “continue”message back to the client application. In response, the client sendsthe digital object to the node and the node stores the digital object,calculates a unique identifier and returns this identifier to the clientapplication. Calculation of write bids may be performed as described inapplication Ser. No. 12/839,071 entitled “Adaptive Power Conservation”which is hereby incorporated by reference. Alternatively, a digitalobject may be written as described above in the application entitled“Two Level Addressing in Storage Clusters.” After the object has beenwritten, it may be replicated to different nodes immediately as manytimes as required, or the cluster may wait for periodic integritychecking in order to replicate the object.

On the other hand, if erasure coding is chosen for storage of the objectwithin the cluster, then in step 416 the SAN issues a request for writebids from all of the nodes within the storage cluster. Once the SAN hasdetermined the specific erasure coding to be used (K:M, based upon aclient instruction, object metadata, or a cluster system setting orconstant), it will then select M nodes to be used for storing the dataand parity segments of the digital object. Preferably, the SAN choosesnodes bidding with the lowest cost, although other techniques may alsobe used such as the best performance, the lowest power consumption, orother criteria.

In one embodiment, in order to reduce risk, nodes may be chosen withindifferent sub-clusters that are physically separated, in order to ensurethat if one sub-cluster is lost that entire objects may be regeneratedfrom segments within the remaining sub-clusters. For example, giventhree sub-clusters, an object encoded with 4:6 encoding, will have itssegments distributed so that two are stored in each of the threesub-clusters. The loss of any one sub-cluster leaves four segments,adequate to reconstruct the object. Similar schemes are possible fordifferent numbers of sub-clusters.

Next, in step 420 the first node to store the first data segment of theobject will be designated and this node will perform a number steps inorder to prepare for erasure coding of the object into the data andparity segments on the different nodes. For instance, the SAN willchoose unique identifiers for all of the data and parity segments in theerasure set (e.g., using a random number generator), will set up chunkedencoded POSTS to the M node recipients, and will determine a maximumsegment size at the beginning of the write which will limit the size ofthis erasure set. The maximum segment size may be determined byreference to a storage cluster setting. Each node returns volumeinformation for its segment on a response to the SAN.

Next, in step 424 the client application begins transferring the dataobject to the SAN which will write the data to each data segment on thedifferent nodes in a stripe and will compute the data for the paritysegments using a suitable erasure coding algorithm. In one embodiment,the Zfec algorithm has been found to work well. For example, referringto FIG. 2 and assuming a 5:7 encoding, the first 32 k block of theobject received will be written to data segment 210 on the first node(the SAN), the second 32 k block will be written to data segment 212 onthe second node, etc. After the fifth data block has been written to thefifth node, the two parity blocks will be computed and stored insegments 220 and 222 in the last two selected nodes of the cluster. Thedigital object continues to be read from the client application andwritten to the M selected nodes in the storage cluster stripe-by-stripeuntil either the end of the object is reached or the end of the segmentis reached, test 428. A remainder stripe 240 may be written as shown ifthere is not a full data block to be written. Once the end of thesegment is reached (but remaining data in the object still needs to bewritten) or the entire object has been written to the M segments, thencontrol moves to step 432 of FIG. 4B.

Next, in step 432 the SAN optionally calculates a hash value for eachsegment based upon the data of each segment and send these values 270(for example) as trailing data to all of the selected nodes that arewriting segments to their volumes. Each node will compare this receivedhash value from the SAN with one it computes on disk for the segment ithas stored on one of its volumes.

Next, in step 436 erasure coding volume hints will be determined andstored for each segment. For example, the volume identifier wheresegment 216 is stored will be written into the system metadata forsegment 214, thus ensuring that in a ring-like fashion, each segmentstores the likely volume identifier for the next segment. Preferably,volume hints are transmitted from the SAN when each segment is writtenusing a POST request. In addition, volume hints may also be written intothe journal of each volume. In other words, once an erasure set has beenwritten to any of a number of nodes, the journal entry for a particularvolume where a given segment has been stored will be updated to includethe volume identifier where the next segment of the object is stored.

Step 440 determines whether there is more data from the digital objectto write into the storage cluster. If so, then in step 444 a new erasureset is begun. In this situation, the SAN will request write bids fromentire cluster in order to determine the next M nodes that will acceptthe writing of data and parity segments to their volumes. Control thenmoves to step 416 in order to write the next erasure set for thisdigital object.

If there is no more data to write, then in step 448 the manifest forthis erasure set is written. As shown in FIG. 3, the manifest willinclude a metadata section and a section for each of the erasure sets. Aunique identifier (hash value, random number, etc.) is calculated foreach segment of each erasure set and stored in section 350, for example.This unique identifier may be calculated in step 416 or later (in thecase of a random number), or in step 428 or later (in the case of a hashvalue). The manifest will be written to any node of the cluster andtreated like any stream written to the storage cluster. In other words,a unique identifier will be calculated for the manifest and the manifestwill be replicated to different nodes within the cluster. Preferably,the manifest will be replicated a total of P+1 times within the cluster.Finally, the unique identifier for the manifest is returned to theclient application so that the client application may access its digitalobject in the future when performing a read operation.

Read Digital Object from Cluster

FIG. 5 is a flow diagram describing how a client application reads adigital object from the storage cluster. Advantageously, the clientapplication need not be aware of which technique (replication or erasurecoding) the storage cluster is using to store the digital object. Bysimply using a unique identifier for the digital object (previouslysupplied by the storage cluster), the client application may retrievethe object from the cluster without needing to rely upon any outsidedatabase or control system. In fact, the client application will beunaware if the unique identifier represents the object which has beenreplicated within the cluster, or represents a manifest indicating thaterasure coding has been used to store the object in the cluster.

In step 504 the client application may make a request of the storagecluster to return a particular digital object identified by a uniqueidentifier. This may be implemented using an SCSP GET request or asimilar HTTP protocol. The client application supplies the identifier toany node of the storage cluster (which becomes the primary access node).Next, in step 508 the PAN broadcast a message to all nodes within thecluster looking for an object having that particular unique identifier.At this point, it is also transparent to the PAN whether or not theunique identifier represents the actual object or a manifest.

Because both actual objects and manifests will be replicated within thestorage cluster, the PAN will receive a number of responses to itsbroadcast request. In one embodiment, each node having a copy of theobject (or of the manifest) calculates a read bid (the cost to read thedigital object); the PAN chooses the node with the lowest read bid,redirects the client application to that node, and that node thenbecomes the secondary access node (SAN). The SAN will be aware ofwhether replication or erasure coding has been used by looking at thesystem metadata of the found object (either the actual object or themanifest of the object).

Accordingly, in step 510 the SAN determines whether it is holding theactual digital object (because replication had been used) or whether itis holding the manifest for the actual digital object (because erasurecoding had been used). If replication had been used, then in step 512the SAN may simply return the digital object to the requesting clientapplication and the method ends. Alternatively, if erasure coding hadbeen used and the SAN is holding the manifest, then in step 516 the SANbegins the process of requesting the segments of cluster in order toreassemble the requested digital object. Using the metadata of themanifest, the SAN is aware of the erasure coding algorithm used and thespecific erasure coding (e.g., 5:7). Preferably, since only the first Kdata segments are necessary to reassemble the digital object, the SANbroadcasts a request for only the first K data segments using the uniqueidentifiers found in section 350 of the manifest. If successful, thencontrol moves to step 528.

If, however, any these first K data segments are missing (step 520),then a request will be broadcast for any needed parity segments. Forexample, if two of the original data segments are missing, then arequest must be broadcast for two of the parity segments using theunique identifiers from the manifest. If the needed number of paritysegments are found, then in step 524 the missing data segment (orsegments) is regenerated using the appropriate erasure coding algorithmand the found parity segments. In one embodiment, the hash value of themissing segment may be computed and compared to the original. Or, it isalso possible to verify the data for each stripe by using a generatedblock as input with blocks from segments 1 to (K−1) to generate block Kand compare that block against the original. If, though, K segmentscannot be found, then an error message is returned to the clientapplication.

Assuming that K segments are either found or generated, then in step 528the SAN determines (using the manifest) whether there is a anothererasure set to be obtained. If not, then control moves to step 532. Ifso, then control moves to step 516 and the SAN begins the process ofrequesting the necessary segments of the second erasure set using theunique identifiers found in the corresponding section for the seconderasure set of the manifest. In step 532 and 536 the SAN marshals thedata and/or parity segments for each erasure set it has retrieved inorder to reconstruct the original data object. For example, when thenecessary K segments are found on nodes within the cluster, the SANreads the data from these segments into its memory and applies theappropriate erasure coding algorithm in order to reconstruct theoriginal digital object on the SAN. Preferably, as each stripe of theobject is retrieved or reconstructed, this data is fed byte-by-byte viaHTTP back to the client application. Should there be more than oneerasure set, the SAN reconstructs the next portion of the digital objectusing the next erasure set, and feeds back the bytes of the next erasureset to the client application. Alternatively, the SAN may assemble theentire object within its own memory before sending the object back tothe client.

Recovery from Volume Failure

FIG. 6 is a flow diagram describing how a storage cluster may recoverfrom a volume failure. As mentioned above, a storage cluster includesany number of computer nodes, each node having any number of hard disksor solid-state disks, referred to as volumes. A storage clustertypically achieves data redundancy by storing different replicas of anobject on different nodes (when replication is used), and by storing thevarious data and parity segments of an object on different nodes (whenerasure coding is used). Consequently, if a disk of a node fails, manyof the replicas and segments of any number of objects will be lost, thusdegrading the purported data redundancy of the entire storage cluster.Further, the quality of a storage cluster is judged not only by how manyvolumes it can afford to lose, but how fast the cluster can recover themissing data when a volume fails. Accordingly, FIG. 6 presents atechnique that not only handles failed volume recovery when there existsa mix of objects stored using replication (replicas) and erasure coding(segments), but also recovers missing segments as quickly as possible.Should an entire node fail, then the below technique is performed foreach volume on the node that has failed.

In step 604 a node of the cluster notices that one of its disks hasfailed. In one instance, once a node has communicated with a volume inthe normal course of business, the node will expect to hear from thatvolume periodically. If not, the node will broadcast a message to all ofits volumes looking for the missing volume. If there is no response,then the node assumes that the volume has failed. In another instance,the storage cluster may be entirely shut down due to maintenance, movingof the cluster, etc. When the cluster is brought back up, it is possiblethat a volume will fail but the node may not notice since it has notheard from the volume before. In this situation, the health processingmodule of the node can assist. The health processing module periodicallychecks the integrity of all streams on each volume; when it detects avolume hint (the unique identifier of the volume) of a particular streamit will look for that volume. If not found, the node will again assumethat the volume has failed. Other techniques for detecting that a volumehas failed may be used.

Next, in step 608, once the unique identifier of the volume that hasfailed has been obtained, each node within the cluster is directed toscan all of its functioning volumes in order to identify streams thatinclude a volume hint for the missing volume. Preferably, the node thathas identified its missing volume sends a broadcast message (includingthe volume identifier) to all other nodes requesting a search forstreams that have a hint for the missing volume. Also, the nodes willperform the search in parallel for efficiency. In one embodiment, thejournal that each volume has recorded on disk is scanned and each streamrepresentation is analyzed to determine the volume hint that itcontains. Because each stream representation in the journal thatrepresents a segment of an erasure-coded object includes the volumeidentifier for the next segment, any such identified stream that includethe volume identifier for the missing volume will also indicate asegment that was on the missing volume. For example, if the streamrepresentation of segment 222 of FIG. 2 includes a volume hint that isthe volume identifier for the missing volume, this means that segment210 was on that volume and needs to be regenerated. In addition,representation of a replicated stream in the journal will include volumehints indicating the volume identifiers for all of the replicas of thatstream. For each stream, the journal typically includes type informationindicating whether the stream represents a replicated object or anerasure-coded object.

In another embodiment, these volume hints may be stored in the systemmetadata 280 of a segment (or the metadata of a replicated stream). Forexample, the system metadata for segment 216 includes a volume hintindicating the volume identifier where segment 218 is stored. It may bepossible for each node to scan its volumes looking for the systemmetadata of each stream on disk, although this technique will be slower.The volume hint may then be read from this system metadata. Again, avolume hint in a particular segment indicating the failed volumeindicates that the next segment is missing. When the node identifiesthat a segment is missing, it can determine the unique identifier forthat missing segment by looking at the metadata 280 of the previoussegment and retrieving the unique identifiers for all sibling segments.In order to find any segments necessary for regenerating a missingsegment, these unique identifiers of the sibling segments may be used.

Once each node has finished scanning its journals (or its streams ondisk) each node will have a list of the missing streams that had been onthe failed volume. Because the storage cluster includes objects storedusing replication and erasure coding, some of these missing streamsrepresent replicated objects, while some of the streams representmissing segments of an erasure-coded object.

For missing replicated streams (if any), in step 612 each nodereplicates a missing stream by requesting bids from the other nodes toreplicate the stream and then transfers control to the chosen node.Assuming that at least one segment of an erasure-coded object is missingfrom a given node, in step 616 (using the unique identifiers obtained instep 608) the given node requests the other nodes to bid for which nodescan supply the K segments necessary to regenerate the missing segment.Once the winning bids are received and the nodes are identified that cansupply the K segments, the given node requests bids for a node toregenerate and store the missing segment.

Accordingly, in step 620 the given node requests bids from nodes withinthe cluster for a particular node to regenerate the missing segment.Once this node is chosen, the node regenerates the missing segment usingthe K segments identified in step 616. This regeneration may beperformed using the appropriate erasure coding algorithm. In oneembodiment, regeneration of the missing segment consumes data in stripesfrom the K segments while the destination stripe is computed and writtenin order to regenerate the missing segment.

In step 624 the node stores the segment on one of its volumes. If thereare any more missing segments identified by the given node, then controlmoves to step 616 and the node again requests bids for the missingsegment as previously described. Since each node within the cluster willbe scanning its volumes in step 608 looking for missing streams, eachnode will also be performing steps 612 through 628 in parallel, assumingthat each node has identified missing streams.

Relocation of Erasure Coded Segments

Once a segment has been written to the storage cluster and provided witha unique identifier, it may be managed as any other stream in thecluster, including replicated streams. In other words, the healthprocessing module may see fit to move a segment from one volume toanother, or from one node to another, independent of the other segmentsin an erasure set and without any loss of availability of the segmentbeing moved. For example, should segment 218 be moved to differentvolume, the volume hint in segment 216 will be updated to indicate thenew volume for segment 218. When segment 218 is relocated the system isaware of all of its sibling segments (and their unique identifiers)because the metadata 280 of segment 218 includes the unique identifiersof all sibling segments, in order. The upstream segment, segment 216,may be retrieved from within the cluster using its unique identifier,and, once the new volume identifier for segment 218 is known, this newvolume identifier may be written into the metadata section 280 ofsegment 216 as a volume hint for the new location of segment 218.Alternatively, the stream representation of segment 216 in its journalmay be updated to include the new volume identifier.

This updating of the volume hint for a relocated segment may beperformed as relocation occurs or may be performed at a later time bythe health processing module. The advantage of this updating is thatsegments may be relocated within the cluster with no loss ofavailability and without the need for any extra control computer orcontrol database to track segments within the cluster.

Conversion of an Erasure-Coded Object to Replication and Vice-Versa

One embodiment of the present invention is able to convert a digitalobject stored within the cluster from one format to another. Forexample, an object stored using an erasure coding of 5:7 may beconverted to a coding of 6:10, an erasure-coded object may be convertedto storage using replication, and an object stored using replication maybe converted to storage using erasure coding. Whether to convert anobject to a different format and when to do that may be dictated byobject metadata, storage cluster default settings, or a combination ofboth.

As mentioned above, the user metadata provided with an object from aclient application can provide information as to how an object should bestored, and may also dictate when the object should be converted to oneformat or another, as described immediately above. For example, the usermetadata may dictate that within a specific time frame, or at aparticular future time, that the object should be converted to differentformat. Or, storage cluster settings and rules may also dictate thatobjects shall be converted at a particular time or times, that objectsof a certain size shall be converted periodically or at a particulartime, or that a certain percentage of objects shall be converted. Thecluster may even accept manual inputs from an administrator that changecluster settings or that dictate how and when conversion should happenfor an object or objects within the cluster. A special conversion modulemay be used for performing conversion, or such functionality may beincorporated into the health processing module of the cluster.

FIG. 7 is a flow diagram describing one embodiment in which an object ina storage cluster may be converted from one storage format to another.Although this diagram deals with a single object, any number of objectswithin the cluster may be converted using this technique. The uniqueidentifier for the object within the cluster remains the same so thatthe client application may retrieve the object using the same uniqueidentifier it was originally provided with. For an object having arandom number as its unique identifier, this random number remains theidentifier for the object after conversion. In the case where a clientapplication supplies a unique name for an object, the cluster may use ahash value of the name as the unique identifier, and this hash valueremains the identifier after conversion.

The concept of versions of objects facilitates retaining the originalunique identifier for the new, converted object. Each object includes atimestamp indicating when it was created; when an object is converted,it will have the same unique identifier as the original object, and thenew object is given a timestamp later than the original. In thisfashion, both unique identifiers may exist within the cluster at thesame time, yet the cluster is aware of which object is the current,valid object by referencing the timestamps. An object having an earliertimestamp than its twin may be deleted at any time as it is not needed.

In step 704 relevant conversion information for an object (or for anynumber of objects) is stored within the storage cluster. As mentionedabove, each object may be received from a client application with usermetadata dictating how an object should be converted, when an objectshould be converted, etc. This user metadata is stored along with eachobject when the object is written to the cluster. This metadata isstored in the manifest of erasure-coded objects.

Additionally, storage cluster settings and rules may dictate defaultconversion settings for objects within the cluster. These settings andrules may be stored in a designated object of the cluster, or may beincluded in the information provided the node during its startup, or maybe provided by a designated source on the network where the node runs.For example, settings may require that all objects be converted from oneerasure coding format to another by a particular date, may require thatonce the objects reach a certain age, may require conversion fromreplication to erasure coding for all objects over a period of time, mayrequire that objects over a certain size be converted to erasure codingby a particular age or date (or gradually over a period of time), etc.Moreover, an administrator may input settings or commands dictating howand when conversion should occur for an object or objects.

In step 708 a triggering condition is detected for a particular objectindicating that the object should be converted from one format toanother. This triggering condition may be detected in many differentmanners. For example, the health processing module in the course ofiterating over objects within the cluster will review the objectmetadata for a particular object when it touches that object. Should thecondition be met (e.g., “convert to erasure coding by or at a particulardate using this particular erasure coding format”) then the object willbe converted as described below. Or, whenever an object is touched oraccessed for whatever reason, its user metadata is reviewed to see ifthe triggering condition is met. Alternatively, the storage clusteritself periodically reviews its cluster settings and rules to determinewhether a time or date has passed indicating that an object or objectsshould be converted according to the cluster settings. Of course, anymanual input from the cluster administrator will be acted uponimmediately and may indicate a trigger condition.

Step 712 results when the trigger condition indicates that the object(currently stored using replication) should be converted to erasurecoding. The unique identifier for the object is obtained (from theobject metadata, cluster settings, administrator input, or other) andthe cluster determines a node on which a replica of the object exists.In step 716 this node reads the object from one of its disks intomemory. Next, in step 720 the node writes the object to the clusterusing the particular erasure coding format determined from usermetadata, system metadata, cluster settings, or administrator input.This step may be performed as discussed above with reference to FIGS. 4Aand 4B, specifically, steps 416-448. This new object written witherasure coding is supplied with the same unique identifier for itsmanifest as the unique identifier used for the original replica read instep 716. In step 724 the original object and any replicas may bedeleted, either immediately or later by the health processing module.The health processing module determines that any replicas having aunique identifier with an earlier timestamp than the newly convertedobject may be deleted.

Step 732 results when the trigger condition indicates that the object(currently stored using erasure coding) should be converted to storageusing replication. The unique identifier for the object is obtained(from the object metadata, cluster settings, administrator input, orother) and the cluster determines a node on which the manifest for theerasure-coded object exists. In step 736 this node reads the object fromthe cluster into memory. This step may be performed as discussed abovewith reference to FIG. 5, specifically, steps 516-532. Next, in step 740this node writes the object as a continuous stream (rather than aserasure-coded segments) to a node of the cluster. This write may beperformed, for example, by broadcasting a request for write bidsthroughout the cluster, and then writing a stream to the node with thewinning bid. Or, other techniques may be used to choose a particularnode or desk to which to write the object. This new object written as asingle stream is supplied with the same unique identifier as the uniqueidentifier used for the original manifest read in step 736. In step 744the original object may be deleted, either immediately or later by thehealth processing module. The health processing module determines thatany manifest (and its associated segments) having a unique identifierwith an earlier timestamp than the newly converted object may bedeleted. In step 748 the newly written object may be replicated tocreate any number of replicas within the cluster and this replicationmay occur immediately or over time as the health processing moduleiterates over this object.

Step 752 results when the trigger condition indicates that the object(currently stored using an old erasure coding) should be converted to anew erasure coding. The unique identifier for the object is obtainedfrom the object metadata and the cluster determines a node on which themanifest for the erasure-coded object exists. In step 756 this nodereads the object from the cluster into memory. This step may beperformed as discussed above with reference to FIG. 5, specifically,steps 516-532. Next, in step 760 the node writes the object to thecluster using the new erasure coding format determined from usermetadata, system metadata, cluster settings, or administrator input.This step may be performed as discussed above with reference to FIGS. 4Aand 4B, specifically, steps 416-448. This object written with newerasure coding is supplied with the same unique identifier for itsmanifest as the unique identifier used for the original object read instep 756. In step 764 the original object may be deleted, eitherimmediately or later by the health processing module. The healthprocessing module determines that any manifest having a uniqueidentifier with an earlier timestamp than the newly converted object maybe deleted.

Management of Digital Objects Across Clusters

Another embodiment of the present invention is able to move a digitalobject from one storage cluster to another storage cluster, and convertthe object to the format required by the new cluster or to that requiredin the object's user metadata. For example, an object stored using anerasure coding of 5:7 in a first cluster may be converted to a coding of6:10 when moved to a second cluster, an erasure-coded object may beconverted to storage using replication when moved to the second cluster,or an object stored using replication in a first cluster may beconverted to storage using erasure coding when moved. Whether to convertan object to a different format may be dictated by the user metadata,storage cluster default settings, a combination of both, an instructionfrom an outside software product, or a cluster administrator.Advantageously, the unique identifier of the object in the first clusteris also retained for use within the second cluster.

FIG. 8 is a flow diagram describing one embodiment in which an object inone storage cluster may be moved to a second storage cluster andconverted from one storage format to another. Although this diagramdeals with a single object, any number of objects within a cluster maybe moved using this technique. Preferably, the unique identifier for theobject within the first cluster remains the same for the object in thesecond cluster so that a client application may retrieve the objectusing the same unique identifier from the second cluster. For example,the unique identifier may be a random number or a hash value of theobject's name supplied by a client application.

In step 804 and instruction is produced to copy (or move) and objectfrom a source cluster into a target cluster. The object may be copiedinto the target cluster for backup purposes (leaving the original in thesource cluster) or the object may simply be moved into the targetcluster and the original deleted. The instruction may originate from anyoutside software product, a client application, from within the clusteritself, or from a cluster administrator. In one embodiment, the “ContentRouter” software product available from Caringo Inc., is used togenerate an instruction to replicate an object from a source clusterinto a target cluster. The instruction includes an identification of theobject to be replicated, such as its unique identifier within the sourcecluster.

The object may be copied from the source cluster in many different ways.For example, the source cluster may read the object and then “push” itto the target cluster, or, the target cluster may “pull” the object fromthe source cluster. In one embodiment, in step 808 a target node isfirst selected in the target cluster to perform the write of the objectwithin the target cluster. The target node may be selected randomly, byusing a bid process, or other technique. Once selected, the target nodeis provided with the unique identifier for the object to be copied andcontact information for the source cluster. For example, the target nodemay be provided with a communication address for the source clusteroverall, with an address of a central or coordinating node within thecluster, or preferably, an IP address of any node within the sourcecluster.

In step 812 any relevant conversion information is identified within thetarget cluster. For example, any default settings or rules that specifyhow the copied object should be stored within the target cluster (i.e.,using replication or erasure coding) are identified. If no defaultsettings are relevant then the conversion information may be taken fromthe user metadata contained within the object to be copied.Alternatively, the instruction to copy the object may include theconversion information.

In step 816 the object is copied from the source cluster to the targetcluster. The target node initiates copying of the object by contactingany node of the source cluster using the IP address provided andprovides the unique identifier for the object. The object may then becommunicated from the source cluster to memory of the target node. Thisstep may be performed, for example, as explained above with reference toFIG. 5, where the target node is acting as the client application. Oncethe target node receives the object (or as the target node is receivedand the object) it will write the note into the target cluster using theappropriate conversion information that has been determined above. Inother words, the object will be written as a continuous stream(replication) or will be written using erasure coding. For example, thiswrite step may be performed as described above with reference to FIGS.4A and 4B where the target node acts as the secondary access node. Forreplication, the target node may solicit bids from the other nodeswithin the target cluster, or may write the object to one of its owndisks. In the case of erasure coding, the segments will be written tovarious nodes within the target cluster. Preferably, the copied objectin the target cluster retains the same unique identifier it had in thesource cluster. Once the object has been copied to the target cluster,it may be retained in the source cluster or deleted at a future time.

Computer System Embodiment

FIGS. 9A and 9B illustrate a computer system 900 suitable forimplementing embodiments of the present invention. FIG. 9A shows onepossible physical form of the computer system. Of course, the computersystem may have many physical forms including an integrated circuit, aprinted circuit board, a small handheld device (such as a mobiletelephone or PDA), a personal computer or a super computer. Computersystem 900 includes a monitor 902, a display 904, a housing 906, a diskdrive 908, a keyboard 910 and a mouse 912. Disk 914 is acomputer-readable medium used to transfer data to and from computersystem 900.

FIG. 9B is an example of a block diagram for computer system 900.Attached to system bus 920 are a wide variety of subsystems.Processor(s) 922 (also referred to as central processing units, or CPUs)are coupled to storage devices including memory 924. Memory 924 includesrandom access memory (RAM) and read-only memory (ROM). As is well knownin the art, ROM acts to transfer data and instructions uni-directionallyto the CPU and RAM is used typically to transfer data and instructionsin a bi-directional manner. Both of these types of memories may includeany suitable of the computer-readable media described below. A fixeddisk 926 is also coupled bi-directionally to CPU 922; it providesadditional data storage capacity and may also include any of thecomputer-readable media described below. Fixed disk 926 may be used tostore programs, data and the like and is typically a secondary storagemedium (such as a hard disk) that is slower than primary storage. Itwill be appreciated that the information retained within fixed disk 926,may, in appropriate cases, be incorporated in standard fashion asvirtual memory in memory 924. Removable disk 914 may take the form ofany of the computer-readable media described below.

CPU 922 is also coupled to a variety of input/output devices such asdisplay 904, keyboard 910, mouse 912 and speakers 930. In general, aninput/output device may be any of: video displays, track balls, mice,keyboards, microphones, touch-sensitive displays, transducer cardreaders, magnetic or paper tape readers, tablets, styluses, voice orhandwriting recognizers, biometrics readers, or other computers. CPU 922optionally may be coupled to another computer or telecommunicationsnetwork using network interface 940. With such a network interface, itis contemplated that the CPU might receive information from the network,or might output information to the network in the course of performingthe above-described method steps. Furthermore, method embodiments of thepresent invention may execute solely upon CPU 922 or may execute over anetwork such as the Internet in conjunction with a remote CPU thatshares a portion of the processing.

In addition, embodiments of the present invention further relate tocomputer storage products with a computer-readable medium that havecomputer code thereon for performing various computer-implementedoperations. The media and computer code may be those specially designedand constructed for the purposes of the present invention, or they maybe of the kind well known and available to those having skill in thecomputer software arts. Examples of computer-readable media include, butare not limited to: magnetic media such as hard disks, floppy disks, andmagnetic tape; optical media such as CD-ROMs and holographic devices;magneto-optical media such as floptical disks; and hardware devices thatare specially configured to store and execute program code, such asapplication-specific integrated circuits (ASICs), programmable logicdevices (PLDs) and ROM and RAM devices. Examples of computer codeinclude machine code, such as produced by a compiler, and filescontaining higher-level code that are executed by a computer using aninterpreter.

Although the foregoing invention has been described in some detail forpurposes of clarity of understanding, it will be apparent that certainchanges and modifications may be practiced within the scope of theappended claims. Therefore, the described embodiments should be taken asillustrative and not restrictive, and the invention should not belimited to the details given herein but should be defined by thefollowing claims and their full scope of equivalents.

We claim:
 1. A method of storing a digital object in a storage cluster,said method comprising: receiving from a client application a request ata computer node of said storage cluster to store said digital object;determining whether to store said digital object in said storage clusterusing replication or erasure coding; when it is determined to store saiddigital object using erasure coding, writing said digital object to aplurality of computer nodes of said storage cluster using erasurecoding, said digital object being stored as a plurality of segments;creating a manifest computer file that includes an indication of saiderasure coding and a unique identifier within said storage cluster foreach of said segments; storing said manifest computer file on a computernode of said storage cluster; and returning a manifest unique identifieridentifying said manifest computer file to said client application.
 2. Amethod as recited in claim 1 further comprising: determining whether tostore said digital object using replication or erasure coding byreference to an inherent property of said digital object, an instructionfrom said client application, or metadata of said digital object.
 3. Amethod as recited in claim 1 further comprising: replicating saidmanifest computer file within said storage cluster and not storing saidmanifest computer file using erasure coding.
 4. A method as recited inclaim 1 further comprising: not replicating said digital object withinsaid storage cluster.
 5. A method as recited in claim 1 furthercomprising: for each segment stored on a disk of said storage cluster,storing on said disk in association with said segment a uniqueidentifier of a next disk that stores another segment of said digitalobject.
 6. A method as recited in claim 5 further comprising: storingsaid unique identifier of said next disk in association with saidsegment by storing said unique identifier in a journal entry for saidsegment on said disk.
 7. A method of reading a digital object from astorage cluster having a plurality of computer nodes, said methodcomprising: receiving, at one of said computer nodes within said storagecluster, a request from a client application that includes a uniqueidentifier for said digital object; determining whether said digitalobject is stored within said storage cluster using replication orerasure coding; when it is determined that said digital object is storedusing erasure coding, reading a manifest stored on a computer nodes ofsaid storage cluster, said manifest being identified by said uniqueidentifier; identifying a plurality of segments within said storagecluster using unique segment identifiers found within said manifest;reconstructing said digital object using said segments and an erasurecoding algorithm; and returning said digital object to said clientapplication.
 8. A method as recited in claim 7 further comprising:determining that said digital object is stored using erasure coding byreference to said manifest.
 9. A method as recited in claim 7 furthercomprising: determining that one of said segments is not present withinsaid storage cluster; and regenerating said segment not present usingother of said segments and an erasure coding algorithm.
 10. A method asrecited in claim 7 further comprising: identifying a first disk whereone of said segments is stored; and reading a disk identifier for asecond disk where another of said segments is stored, said diskidentifier being stored in association with said one of said segments onsaid first disk.
 11. A method as recited in claim 7 further comprising:identifying a second erasure set within said manifest, said seconderasure set including a plurality of second unique segment identifiers;and reconstructing said digital object using said segments, a pluralityof second segments identified by said second unique segment identifiers,and said erasure coding algorithm.
 12. A method as recited in claim 7wherein said manifest is replicated within said storage cluster andwherein said manifest is not stored within said storage cluster usingerasure coding.
 13. A method as recited in claim 7 wherein said digitalobject is not replicated within said storage cluster.
 14. A method ofrecovering from a failed disk, said method comprising: detecting withina storage cluster having a plurality of computer nodes that a first diskof one of said plurality of computer nodes has failed; scanning apersistent storage area of a second disk of said storage cluster to finda unique identifier of said failed disk, said unique identifier inassociation with a digital stream of said storage cluster; determiningwhether said digital stream is stored within said storage cluster usingreplication or erasure coding; when it is determined that said digitalstream is stored using erasure coding, identifying a missing segmentpreviously stored on said failed disk; locating a plurality of othersegments within said storage cluster, said plurality of other segmentsincluding said digital stream; regenerating said missing segmentpreviously stored on said failed disk using said plurality of othersegments and an erasure coding algorithm; and storing said regeneratedsegment on a computer node of said storage cluster.
 15. A method asrecited in claim 14 further comprising: scanning a metadata section ofsaid digital stream to find unique identifiers for said plurality ofother segments.
 16. A method as recited in claim 14 further comprising:scanning said persistent storage area of said second disk includesscanning a journal entry for said digital stream.
 17. A method asrecited in claim 16 further comprising: determining whether said digitalstream is stored using erasure coding by reference to said journalentry.
 18. A method as recited in claim 14 further comprising:identifying said missing segment previously stored on said failed diskbefore receiving a request from outside the storage cluster for adigital object that is stored within said storage cluster using erasurecoding that includes said missing segment.
 19. A method as recited inclaim 14 wherein said missing segment is not replicated within saidstorage cluster.
 20. A method of relocating a segment within a storagecluster, said method further comprising: identifying a segment on afirst disk of a computer node within a storage cluster, said segmentbeing one of a plurality of segments representing a digital objectstored within said storage cluster; relocating said segment from saidfirst disk to a second disk of said storage cluster, said second diskbeing identified by a second unique disk identifier; retrieving frommetadata of said segment a unique identifier for a sibling segment ofsaid plurality of segments, said sibling segment including in itsmetadata a first unique disk identifier for said first disk; locatingsaid sibling segment within said storage cluster using said uniqueidentifier; and replacing said first unique disk identifier for saidfirst disk within said metadata of said sibling segment with a secondunique disk identifier for said second disk, whereby said metadata ofsaid sibling segment indicates said second disk to where said segmenthas been relocated.
 21. A method of converting a digital object within astorage cluster, said method comprising: storing said digital objectwithin said storage cluster on a single disk of a computer node as acontinuous stream of bits, said digital object having a uniqueidentifier within said storage cluster; after said storing, identifyingmetadata of said storage cluster indicating a requirement to convertsaid digital object to an erasure coding storage format; reading saiddigital object from said single disk using said computer node of saidstorage cluster; writing said digital object to a plurality of disks ofsaid storage cluster using said erasure coding storage format; andretaining said unique identifier for said digital object written in saiderasure coding storage format, whereby a client application may retrievesaid digital object written in said erasure coding storage format usingsaid unique identifier.
 22. A method of converting a digital objectwithin a storage cluster, said method comprising: storing said digitalobject within said storage cluster in a first erasure coding storageformat, said digital object having a unique identifier within saidstorage cluster; after said storing, identifying metadata of saidstorage cluster indicating a requirement to convert said digital objectto a second erasure coding storage format; reading said digital objectfrom said storage cluster using a computer node of said storage cluster;writing said digital object to said storage cluster using said seconderasure coding storage format; and retaining said unique identifier forsaid digital object written in said second erasure coding storageformat, whereby a client application may retrieve said digital objectwritten in said second erasure coding storage format using said uniqueidentifier.
 23. A method of converting a digital object within a storagecluster, said method comprising: storing said digital object within saidstorage cluster in an erasure coding storage format, said digital objecthaving a unique identifier within said storage cluster; after saidstoring, identifying metadata of said storage cluster indicating arequirement to convert said digital object to a storage format usingreplication; reading said digital object from said storage cluster usinga computer node of said storage cluster; writing said digital object toa single disk of a computer node of said storage cluster as a continuousstream of bits; and retaining said unique identifier for said digitalobject written as said continuous stream of bits, whereby a clientapplication may retrieve said digital object written as said continuousstream of bits using said unique identifier.
 24. A method of copying adigital object from a source storage cluster to a target storagecluster, said method comprising: receiving an instruction at a targetnode of said target storage cluster to copy said digital object fromsaid source storage cluster to said target storage cluster, saidinstruction including an address of a source node of said source storagecluster; determining a target conversion format into which said digitalobject will be stored in said target storage cluster; reading saiddigital object from said source storage cluster, said digital objectbeing stored in a source conversion format and having a uniqueidentifier within said source storage cluster; and storing said digitalobject in said target storage cluster using said target conversionformat, said digital object being stored using said unique identifier.25. A method as recited in claim 24 wherein said target conversionformat is different from said source conversion format.
 26. A method asrecited in claim 24 wherein said source conversion format is replicationor erasure coding.
 27. A method as recited in claim 24 furthercomprising: determining said target conversion format by reference tothe metadata of said digital object or by reference to a setting of saidtarget storage cluster.
 28. A method as recited in claim 24 wherein saidtarget storage cluster implements a storage format different from saidsource storage cluster.